HPSG-Style Underspecified Japanese Grammar with Wide Coverage
نویسندگان
چکیده
This paper describes a wide-coverage Japanese grammar based on HPSG. The aim of this work is to see the coverage and accuracy attainable using an underspecified grammar. Underspecification, allowed in a typed feature structure formalism, enables us to write down a wide-coverage grammar concisely. The grammar we have implemented consists of only 6 ID schemata, 68 lexical entries (assigned to functional words), and 63 lexical entry templates (assigned to parts of speech (BOSs) ) . Furthermore. word-specific constraints such as subcategorization of verbs are not fixed in the grammar. However. this granllnar call generate parse trees for 87% of the 10000 sentences in the Japanese EDR corpus. The dependency accuracy is 78% when a parser uses the heuristic that every bunsetsu 1 is attached to the nearest possible one. 1 I n t r o d u c t i o n Our purpose is to design a practical Japanese grammar based on HPSG (Head-driven Phrase Structure Grammar) (Pollard and Sag, 1994), with wide coverage and reasonable accuracy for syntactic structures of real-world texts. In this paper, "coverage" refers to the percentage of input sentences for which the grammar returns at least one parse tree, and "accuracy" refers to the percentage of bunsetsus which are at tached correctly. To realize wide coverage and reasonable accuracy, the following steps had been taken: A) At first we prepared a linguistically valid but coarse grammar with wide coverage. B) We then refined the grammar in regard to accuracy, using practical heuristics which are not linguistically motivated. As for A), the first grammar we have constructed actually consists of only 68 lexical en* This research is part ial ly founded by the pro jec t of JSPS ( JSPS-RFTF96P00502) . 1A bunsetsu is a common unit when syntac t ic structures in Japanese are discussed. tries (LEs) for some functional words 2, 63 lexical entry templates (LETs) for POSs 3, and 6 ID schemata. Nevertheless, the coverage of our grammar was 92% for the Japanese corpus in the EDR Electronic Dictionary (EDR, 1996), mainly due to underspecification, which is allowed in HPSG and does not always require detailed grammar descriptions. As for B), in order to improve accuracy, the grammar should restrict ambiguity as much as possible. For this purpose, the g rammar needs more constraints in itself. To reduce ambiguity, we added additional feature structures which may not be linguistically valid but be empirically correct, as constraints to i) the original LFs and LETs, and ii) the ID schemata. The rest of this paper describes the architecture of our Japanese grammar (Section 2). refinement of our grammar (Section 3), experimental results (Section 4). and discussion regarding errors (Section 5). 2 A r c h i t e c t u r e o f J a p a n e s e G r a m m a r In this section we describe the architecture of the HPSG-style Japanese grammar we have developed. In the HPSG framework, a grammar consists of (i) immediate dominance schemata (ID schemata), (ii) principles, and (iii) lexical entries (LEs). All of them are represented by typed feature structures (TFSs) (Carpenter, 1992), the fundamental data structures of HPSG. ID schemata, corresponding to rewriting rules in CFG, are significant for constructing syntactic structures. The details of our ID schemata are discussed in Section 2.1. Principles are constraints between mother and daughter feature structures. 4 LEs, which compose the lexicon, are detailed constraints on each word. In our grammar, we do not always assign LEs to each word. Instead, we assign lexical entry 2A functional word is assigned one or more LEs. SA POS is also assigned one or more LETs. 4We omit fur ther explana t ion about principles here due to l imited space.
منابع مشابه
Corpus-Oriented Development of Japanese HPSG Parsers
This paper reports the corpus-oriented development of a wide-coverage Japanese HPSG parser. We first created an HPSG treebank from the EDR corpus by using heuristic conversion rules, and then extracted lexical entries from the treebank. The grammar developed using this method attained wide coverage that could hardly be obtained by conventional manual development. We also trained a statistical p...
متن کاملLinking Flat Predicate Argument Structures
This report presents an approach to enriching flat and robust predicate argument structures with more fine-grained semantic information, extracted from underspecified semantic representations and encoded in Minimal Recursion Semantics (MRS). Such representations are provided by a hand-built HPSG grammar with a wide linguistic coverage. A specific semantic representation, called linked predicate...
متن کاملHPSG Parsing with Shallow Dependency Constraints
We present a novel framework that combines strengths from surface syntactic parsing and deep syntactic parsing to increase deep parsing accuracy, specifically by combining dependency and HPSG parsing. We show that by using surface dependencies to constrain the application of wide-coverage HPSG rules, we can benefit from a number of parsing techniques designed for highaccuracy dependency parsing...
متن کاملHead-Initial Constructions in Japanese
Japanese is often taken to be strictly head-final in its syntax. In our work on a broad-coverage, precision implemented HPSG for Japanese, we have found that while this is generally true, there are nonetheless a few minor exceptions to the broad trend. In this paper, we describe the grammar engineering project, present the exceptions we have found, and conclude that this kind of phenomenon moti...
متن کاملEfficient Deep Processing of Japanese
We present a broad coverage Japanese grammar written in the HPSG formalism with MRS semantics. The grammar is created for use in real world applications, such that robustness and performance issues play an important role. It is connected to a POS tagging and word segmentation tool. This grammar is being developed in a multilingual context, requiring MRS structures that are easily comparable acr...
متن کامل